This episode introduces learners to the basics of text-mining concepts and analysis using the corpus of Arabian Nights downloaded from the Gutenberg Project. The script combines code developed for Digital Methods for Historians course with student contributions from 2022. The workflow, functions, and toolkit builds on Julia Silge and David Robinson’s Text Mining with R book which is also recommended as the main reference.
The episode is best suited to intermediate learners of R who know about regular expressions.
First you will need to install/have the following packages:
library(tidyverse)
library(here)
library(tidytext)
library(textdata)
library(ggwordcloud)The text of the Arabian Nights was downloaded from the Gutenberg
project using the gutenbergr package in script
08b-text-download.R, and then saved as .rds dataset. We now load that
dataset into R memory.
an_df <- readRDS("data/arabian.rds")In the dataframe there is a lot of text, that aren’t part of the stories that comprise the Arabian Nights. To get the most accurate result we will trim most of the excessive text away by specific rows:
an_df_tidy <- an_df[-c(1:780, 1545:1844, 2268:2437, 3866:4536, 7995:9555, 9846:9913, 11869:12845, 15962:18806, 151902:189676, 148281:150158, 131077:133627, 114421:116373, 97810:99392, 83707:85877, 65031:66536, 48903:51794),]Let’s figure out the frequency of words in the Arabian Nights. The first thing we have to do is to tokenize the dataframe, so that every word gets it’s own row in a dataframe:
an_tokens <- an_df_tidy %>%
unnest_tokens(word, text)Now we can count how many there are of each word, and sort them after the most frequent one:
an_wc <- an_tokens %>%
count(word) %>%
arrange(-n)
an_wc %>% slice(1:100)%>% pull(word) [1] "and" "the" "of" "to" "he" "a" "i"
[8] "in" "him" "his" "with" "her" "my" "it"
[15] "me" "for" "that" "is" "she" "thou" "this"
[22] "was" "said" "thee" "when" "so" "o" "they"
[29] "but" "not" "on" "them" "by" "from" "as"
[36] "thy" "then" "who" "be" "had" "al" "all"
[43] "king" "which" "will" "at" "what" "allah" "have"
[50] "one" "we" "till" "hath" "now" "footnote" "day"
[57] "their" "out" "came" "night" "an" "there" "quoth"
[64] "upon" "no" "up" "or" "us" "were" "saying"
[71] "after" "say" "went" "man" "are" "these" "down"
[78] "shall" "do" "like" "son" "before" "saw" "made"
[85] "answered" "if" "took" "our" "two" "would"
[ reached getOption("max.print") -- omitted 10 entries ]
Then we can add the stopwords list:
an_stop <- an_tokens %>%
anti_join(stop_words)Then we count and sort the words again:
an_swc <- an_stop %>%
count(word) %>%
arrange(-n)
an_swc %>% slice(1:100)%>% pull(word) [1] "thou" "thee" "thy" "al" "king" "allah"
[7] "till" "hath" "footnote" "day" "night" "quoth"
[13] "son" "answered" "replied" "love" "ceased" "heard"
[19] "hundred" "lord" "heart" "set" "reached" "hand"
[25] "found" "house" "city" "head" "dawn" "time"
[31] "hast" "wazir" "brought" "art" "father" "cried"
[37] "fell" "eyes" "arab" "auspicious" "shahrazad" "caliph"
[43] "left" "days" "perceived" "permitted" "brother" "life"
[49] "words" "lady" "woman" "returned" "told" "sat"
[55] "wife" "save" "whilst" "rose" "thousand" "daughter"
[61] "bring" "ye" "called" "presently" "door" "gold"
[67] "thine" "slave" "folk" "bade" "palace" "hands"
[73] "mother" "wilt" "leave" "din" "water" "death"
[79] "return" "fear" "behold" "land" "hasan" "entered"
[85] "abu" "couplets" "almighty" "dinars" "ali" "sight"
[ reached getOption("max.print") -- omitted 10 entries ]
Because the translation dates to the 19th century, there are a lot of old words that are not in the stopword list, but should still be removed for practicality’s sake. Therefore we should create an additional stopword list:
my_stop_words <- data.frame(word = c("thou", "thee", "thy", "till", "hath", "quoth", "footnote", "answered", "replied", "set", "al", "arab", "ye"))Now we use the newly created stopword list to remove the extra words from our an_df_tidy dataframe:
an_stop_new <- an_stop %>%
anti_join(my_stop_words)Then we count and sort the words again:
an_stop_new %>%
count(word) %>%
arrange(-n)# A tibble: 33,991 × 2
word n
<chr> <int>
1 king 4953
2 allah 4219
3 day 3199
4 night 2962
5 son 2102
6 love 1653
7 ceased 1641
8 heard 1595
9 hundred 1574
10 lord 1564
# … with 33,981 more rows
Let’s now make a wordcloud with the most frequent words in the Arabian Nights. The first thing we want to do is to filter out any numbers from the text:
an_no_numeric <- an_stop_new %>%
filter(is.na(as.numeric(word)))length(unique(an_no_numeric$word))[1] 33123
By using this code we can see that there are over 33.000 unique words in the Arabian Nights. That is too many to fit into one wordcloud, so we want to limit the number of words in the wordcloud to the most frequent100.
To limit the number of words in the wordcloud, do this:
an_top100 <- an_no_numeric %>%
count(word) %>%
arrange(-n) %>%
head(100)Now we make a wordcloud based on the new dataframe, we just created, with only the 100 most frequent words in the Arabian Nights:
ggplot(data = an_top100, aes(label = word, size = n)) +
geom_text_wordcloud_area(aes(color = n), shape = "diamond") +
scale_size_area(max_size = 12) +
scale_color_gradientn(colors = c("#72286F","darkred","#CF2129")) +
theme_minimal()
I can then see that the most frequent word is “king”, but it is worth
noting that words like “love” and “wife” also is on the list of the 100
most frequent words in the Arabian Nights, and that is words that would
be interesting to take a little closer look at.
First we’ll do a sentiment analysis on the Arabian Nights. We focus on the ten main emotions in the “nrc” lexicon, which we have to load first:
get_sentiments(lexicon = "nrc")# A tibble: 13,872 × 2
word sentiment
<chr> <chr>
1 abacus trust
2 abandon fear
3 abandon negative
4 abandon sadness
5 abandoned anger
6 abandoned fear
7 abandoned negative
8 abandoned sadness
9 abandonment anger
10 abandonment fear
# … with 13,862 more rows
Citation for NRC lexicon: Crowdsourcing a Word-Emotion Association Lexicon, Saif Mohammad and Peter Turney, Computational Intelligence, 29 (3), 436-465, 2013.
Then we’ll create a new dataframe where we add the lexicon to my existing dataframe:
an_nrc <- an_stop_new %>%
inner_join(get_sentiments("nrc"))By binding the lexicon to the dataframe, we exclude words, that don’t have a “sentiment” value. It is smart to just have a look at which words are excluded:
an_exclude <- an_stop_new %>%
anti_join(get_sentiments("nrc"))
an_exclude_n <- an_exclude %>%
count(word, sort = TRUE)
head(an_exclude_n)# A tibble: 6 × 2
word n
<chr> <int>
1 allah 4219
2 day 3199
3 night 2962
4 son 2102
5 ceased 1641
6 heard 1595
It is interesting that a word like “allah” has been excluded, you would think that it has a “sentiment” value.
Now we count how many words fit into each of the 10 different “sentiment” categories, and then we plot them:
an_nrc_n <- an_nrc %>%
count(sentiment, sort = TRUE)
ggplot(data = an_nrc_n, aes(x = sentiment, y = n)) +
geom_bar(stat = "identity",
#color = "darkred",
#fill = "#D60103"
) +
theme_light() +
ggtitle("Number of words by sentiment") +
xlab("Sentiment") + ylab("Number of words")From this we can see that there are a lot of words that are of a positive “sentiment”.
Since we are interested in words like “love” and others that are related to it, let’s look which “sentiment” categories that the word “love” is in:
love <- get_sentiments(lexicon = "nrc") %>%
filter(word == "love", ignore_case = TRUE)
love# A tibble: 2 × 2
word sentiment
<chr> <chr>
1 love joy
2 love positive
I can now see that the word “love” is both in the category of “joy” and “positive”. Therefore we want to take a closer look at those two categories, to see which other words are in them, and if they are relevant for my project. We focus on “joy” first:
The first step is to create a new dataframe that only contains the words in the “joy” category:
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")Then we count the words, and sort them by most frequent. we also limit them to only the 15 most frequent. We then plot those 15 words, and order them by frequency:
nrc_joy_sort <- an_stop_new %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE) %>%
head(15)
ggplot(data = nrc_joy_sort, aes(x = reorder(word, -n), y = n)) +
geom_bar(stat = "identity", color = "darkred", fill = "#D60103") +
theme_light() +
ggtitle("Frequency of words in 'joy' sentiment") +
xlab("Word") + ylab("Number of words")
It’s interesting to see, that apart from “love”, words like “youth” and
“beauty” are also relatively frequent.
Now we do the same for the category “positive”. We first create a new dataframe:
nrc_positive <- get_sentiments("nrc") %>%
filter(sentiment == "positive")Then we count the words, and sort them by most frequent. We also limit the plot to the 15 most frequent ones:
nrc_positive_sort <- an_stop_new %>%
inner_join(nrc_positive) %>%
count(word, sort = TRUE) %>%
head(15)
ggplot(data = nrc_positive_sort, aes(x = reorder(word, -n), y = n)) +
geom_bar(stat = "identity", color = "darkred", fill = "#D60103") +
theme_light() +
ggtitle("Frequency of words in 'positive' sentiment") +
xlab("Word") + ylab("Number of words")
Here we can see, that the word “king” is much more frequent than any of
the other positive words.
Now we take a look at which stories the words such as love and marriage occur most frequently in. This can help us focus on reading the most relevant tales:
First, we want to split the book up into smaller sections, to get the most precise results, when we look for where in the Arabian Nights we can find a high frequency of words. Ideally, we would have split up the book by separate tales, but finding a regex for that is a bit complex, and so we instead split it by separate nights, this way:
an_df_tidy %>%
mutate(
linenumber = row_number()) %>%
filter(str_detect(text, regex("Now when it w*as the .+ Night", ignore_case = TRUE)))# A tibble: 1,000 × 3
gutenberg_id text linenumber
<int> <chr> <int>
1 51252 " Now when it was the Second Night," 923
2 51252 " Now when it was the Third Night," 1174
3 51252 " Now when it was the Fourth Night," 1388
4 51252 " Now when it was the Fifth Night," 1598
5 51252 " Now when it was the Sixth Night," 1947
6 51252 " Now when it was the Seventh Night," 2084
7 51252 " Now when it was the Eighth Night," 2351
8 51252 " Now when it was the Ninth Night," 2553
9 51252 " Now when it was the Tenth Night," 2975
10 51252 " Now when it was the Eleventh Night," 3337
# … with 990 more rows
When you run this, you should get a tibble, that has a 1000 rows. we can only capture 1000 nights, since the first night isn’t written out, so it is as close as we can get.
We now make it into a new dataframe, where each night has a number associated with it:
tidy_stories <- an_df_tidy %>%
mutate(linenumber = row_number(), chapter = cumsum(str_detect(text, regex("Now when it w*as the .+ Night", ignore_case = TRUE)))) It is important to note that the first story has the number “0” and the second story has the number “1” and so on. Therefore to get the correct night, you have to take the number you get in the code, and add 1 to it. When we find out which nights the words are most frequent in, then we can look up which story the night belongs to, and then read it. we can find a list over stories, with night numbers here: https://en.wikipedia.org/wiki/List_of_stories_within_One_Thousand_and_One_Nights
Now we want to find out which night has the most frequent use of the word “love”. We use the regex “love”, instead of “\blove\b”, because this catches words like “lovers” and “lovely”. We also limit the number of nights to 15 to see the 15 nights that have the most frequent use of variations of the word “love”:
tidy_love <- tidy_stories %>%
filter(str_detect(text, regex("love", ignore_case = TRUE))) %>%
count(chapter, sort = TRUE) %>%
head(20)Then plot the result:
tidy_love %>%
ggplot(aes(x = reorder(chapter, -n), y = n)) +
geom_bar(stat = "identity", color = "darkred", fill = "#D60103") +
theme_light() +
ggtitle("Nights with the most frequent use of variation of the word 'love'") +
xlab("Night") + ylab("Number of words")tidy_stories %>%
filter(chapter == 845)# A tibble: 436 × 4
gutenberg_id text linen…¹ chapter
<int> <chr> <int> <int>
1 55091 " Now when it was the Eight Hundred and Forty-sixt… 107877 845
2 55091 "" 107878 845
3 55091 "She continued, It hath reached me, O auspicious King, th… 107879 845
4 55091 "merchant awoke, he strave with his yearnings till mornin… 107880 845
5 55091 "to himself, “There is no help but that I go this day to … 107881 845
6 55091 "will expound to me this vision.” So he went forth and wa… 107882 845
7 55091 "left, till he was far from his dwelling-place, but found… 107883 845
8 55091 "interpret the dream to him. Then he would have returned,… 107884 845
9 55091 "behold, the fancy took him to turn aside to the house of… 107885 845
10 55091 "trader, a man of the wealthiest, and when he drew near t… 107886 845
# … with 426 more rows, and abbreviated variable name ¹linenumber
Here we can see that night 846, is the night where there is the most frequent use of the word “love”. That is the tale of Masrur and Zayn Al-Mawasif.
Another word, or variations of a word, that we can look at is “marr” for marriage. Let’s repeat the two previous steps and replace “love” with “marr”.
tidy_marr <- tidy_stories %>%
filter(str_detect(text, regex("marr", ignore_case = TRUE))) %>%
count(chapter, sort = TRUE) %>%
head(20)
tidy_marr %>%
ggplot(aes(x = reorder(chapter, -n), y = n)) +
geom_bar(stat = "identity", color = "darkred", fill = "#D60103") +
theme_light() +
ggtitle("Nights with the most frequent use of variation of the word 'marr'") +
xlab("Night") + ylab("Number of words")
Here we can see that night 978, is the night where there is the most
frequent use of the word “love”. That is the tale of Miriam the Girdle
Girl.
Looking at the frequency of a single word in different sections of a book is one way to look at word frequency, but such frequency can be skewed by the length and diversity of words used within each section. It is therefore more cautious to look at the number of one specific word in relation to the total number of words in one night. To do that we have to do a term frequency analysis, which is explained here: https://www.tidytextmining.com/tfidf.html
The first step is to once again tokenize the dataframe. This time we tokenize the dataframe, where each night/tale also gets a number assigned to it. We also count all the words, but this time they are split by night number(tale).
tidy_stories_untoken <- tidy_stories %>%
unnest_tokens(word, text) %>%
count(chapter, word, sort = TRUE)Then we count how many words are in each night(chapter)
total_words_stories <- tidy_stories_untoken%>%
group_by(chapter) %>%
summarize(total = sum(n))Then we join those different counts in to a new dataframe, that then contains “chapter”, “word”, “n” and “total”
chapter_words <- left_join(tidy_stories_untoken, total_words_stories)Lastly we want to calculate the percentage of target words in a night. We do that by filtering the target word, then by using the mutate() function to create a new column where we calculate the frequency of a word in percent. By arranging the rows based on frequency, we can then see which night has the highest percentage of target words in relation to total words:
chapter_words %>%
filter(str_detect(word, regex("love", ignore_case = TRUE))) %>%
mutate(frequency =((n/total)*100)) %>%
arrange(desc(frequency))# A tibble: 1,583 × 5
chapter word n total frequency
<int> <chr> <int> <int> <dbl>
1 855 love 17 1131 1.50
2 860 love 9 610 1.48
3 180 love 8 628 1.27
4 861 love 12 1035 1.16
5 786 love 14 1225 1.14
6 371 love 10 881 1.14
7 179 beloved 8 739 1.08
8 196 love 6 572 1.05
9 886 love 17 1663 1.02
10 373 love 11 1111 0.990
# … with 1,573 more rows
In this way we can see that the night, which has the most words being “love” in relation to the total number of words in the night is 856.
It would also be interesting to have a look at the word “sex”. Does it even appear in the Arabian Nights?
tidy_stories %>%
filter(str_detect(text, regex("sex", ignore_case = TRUE)))# A tibble: 31 × 4
gutenberg_id text linen…¹ chapter
<int> <chr> <int> <int>
1 51252 " say which be the better and the nicer, mine or his.\" S… 7955 23
2 51775 " Arab. \"Farj\"; hence a facetious designation of the o… 14599 44
3 51775 "neither accost one of our sex, be she young or be she ol… 23180 119
4 51775 " societies where the sexes are separated speech becomes… 23708 123
5 51775 " beauty or conditions of the \"veiled sex.\" Even publi… 24704 123
6 51775 " and pronounced so by Egyptians. It is worn by both sex… 25006 123
7 51775 " women prefer their own sex. These tribades are mostly … 25364 123
8 51775 " the law of unlikeness which mostly governs sexual unio… 25834 123
9 52564 "with any of her sex. Know that she who wrought these gaz… 27981 127
10 52564 "and proportions strongly excited her desires sexual. So … 29048 133
# … with 21 more rows, and abbreviated variable name ¹linenumber
Now that we see the context of “sex”, we see that it mainly appears in the relation to gender and in some of the footnotes we haven’t been able to trim away.
A word that is often used as a metaphor for sex and the woman’s body is “pomegranate”. Let’s look where that appears in the Arabian Nights:
tidy_stories %>%
filter(str_detect(text, regex("pomegranate", ignore_case = TRUE)))# A tibble: 68 × 4
gutenberg_id text linen…¹ chapter
<int> <chr> <int> <int>
1 51252 " pomegranate-bloom, eglantine and narcissus, and set the… 2644 8
2 51252 " pomegranates of even size, stood at bay as it were;[146… 2689 8
3 51252 " and crept into a huge red pomegranate,[250] which lay b… 4346 13
4 51252 " pomegranate swelled to the size of a water-melon in air… 4348 13
5 51252 " beginning. I had no trouble till the time when the pome… 4404 13
6 51252 " its leafy screen pomegranate hides from sight:" 7160 21
7 51252 " He had just dressed a conserve of pomegranate grains wi… 7643 22
8 51252 " ladled into a saucer some conserve of pomegranate-grain… 7676 22
9 51252 " confection of pomegranate-grains. When the twain drew n… 7836 22
10 51252 " pomegranate-grains. Said Ajib, \"Sit thee down and eat … 7865 22
# … with 58 more rows, and abbreviated variable name ¹linenumber
From this we can see that the word “pomegranate” shows up in the Arabian Nights sixty-eight times in different contexts. The contexts mostly seem non-sexual, but there are definitely some examples of using it as sexual metaphors, eg. row 31, 38, etc..
an_bigrams <- tidy_stories %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
filter(!is.na(bigram))
an_trigrams <- tidy_stories %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
filter(!is.na(trigram))
an_trigrams %>%
filter(str_detect(trigram, regex("pomegranate", ignore_case = TRUE)))# A tibble: 155 × 4
gutenberg_id linenumber chapter trigram
<int> <int> <int> <chr>
1 51252 2644 8 pomegranate bloom eglantine
2 51252 2689 8 pomegranates of even
3 51252 4346 13 huge red pomegranate
4 51252 4346 13 red pomegranate 250
5 51252 4346 13 pomegranate 250 which
6 51252 4348 13 pomegranate swelled to
7 51252 4404 13 when the pomegranate
8 51252 4404 13 the pomegranate burst
9 51252 7160 21 leafy screen pomegranate
10 51252 7160 21 screen pomegranate hides
# … with 145 more rows
As one might expect, a lot of the most common bigrams are pairs of common (uninteresting) words, such as of the and to be: what we call “stop-words” . This is a useful time to use tidyr’s separate(), which splits a column into multiple based on a delimiter. This lets us separate it into two columns, “word1” and “word2”, at which point we can remove cases where either is a stop-word. Then we unite the two new columns and filter for target term.
bigrams_separated <- an_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word & !word1 %in% my_stop_words$word) %>%
filter(!word2 %in% stop_words$word & !word2 %in% my_stop_words$word)
# new bigram counts:
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
# search for pomegranate
bigrams_united %>%
filter(str_detect(bigram, regex("pomegranate", ignore_case = TRUE))) %>%
count(bigram, sort = TRUE) # A tibble: 28 × 2
bigram n
<chr> <int>
1 pomegranate grains 22
2 pomegranate tree 3
3 pomegranates twain 3
4 twin pomegranates 3
5 conserved pomegranate 2
6 cooked pomegranate 2
7 pomegranate flower 2
8 pomegranate hides 2
9 pomegranate seed 2
10 fruits pomegranate 1
# … with 18 more rows
Bigrams and trigrams with frequencies can further be visualized as networks to facilitate assessment of relationships and topic groups. While that is beyond the scope of this script, you can explore network building further in https://www.tidytextmining.com/ngrams
This tutorial was inspired by Laura Bang Jensen’s (2022) final project on the Arabian Nights.
tidytext is a toolbox packed with text-mining functions
for structuring and analyzing textgutenbergr library facilitates the download of
digitized texts from the Gutenberg Project archivesessionInfo()R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggwordcloud_0.5.0 textdata_0.4.4 tidytext_0.3.4 here_1.0.1
[5] forcats_0.5.2 stringr_1.4.1 dplyr_1.0.9 purrr_0.3.4
[9] readr_2.1.2 tidyr_1.2.0 tibble_3.1.8 ggplot2_3.3.6
[13] tidyverse_1.3.2 unilur_0.4.0.9100 knitr_1.40
loaded via a namespace (and not attached):
[1] httr_1.4.4 sass_0.4.2 jsonlite_1.8.2 modelr_0.1.9
[5] bslib_0.4.0 assertthat_0.2.1 highr_0.9 googlesheets4_1.0.1
[9] cellranger_1.1.0 yaml_2.3.5 pillar_1.8.1 backports_1.4.1
[13] lattice_0.20-45 glue_1.6.2 digest_0.6.29 rvest_1.0.3
[17] colorspace_2.0-3 htmltools_0.5.3 Matrix_1.5-1 pkgconfig_2.0.3
[21] broom_1.0.0 haven_2.5.1 scales_1.2.1 tzdb_0.3.0
[25] googledrive_2.0.0 generics_0.1.3 farver_2.1.1 ellipsis_0.3.2
[29] cachem_1.0.6 withr_2.5.0 cli_3.3.0 magrittr_2.0.3
[33] crayon_1.5.1 readxl_1.4.1 evaluate_0.16 tokenizers_0.2.3
[37] janeaustenr_1.0.0 fs_1.5.2 fansi_1.0.3 SnowballC_0.7.0
[41] xml2_1.3.3 tools_4.2.2 hms_1.1.2 gargle_1.2.0
[45] lifecycle_1.0.1 munsell_0.5.0 reprex_2.0.2 compiler_4.2.2
[49] jquerylib_0.1.4 rlang_1.0.4 grid_4.2.2 rstudioapi_0.14
[53] rappdirs_0.3.3 labeling_0.4.2 rmarkdown_2.16 gtable_0.3.0
[57] DBI_1.1.3 R6_2.5.1 lubridate_1.8.0 fastmap_1.1.0
[61] utf8_1.2.2 rprojroot_2.0.3 stringi_1.7.8 Rcpp_1.0.9
[65] vctrs_0.4.1 png_0.1-7 dbplyr_2.2.1 tidyselect_1.1.2
[69] xfun_0.32